Lab 07

Using Data Structures to make a Concordance
Due by 6pm on Tuesday, October 31

The purpose of this lab is to:

 

Getting Started

You will want to download the following test files for you program:

Prufrock.txt          (T.S. Eliot's The Love Song of J. Alfred Prufrock)
Jabberwocky.txt   (The Lewis Carrol poem)
FoxInSocks.txt     (by Dr. Seuss)
Test.txt                 ( a file for testing your line numbering)

Concordances.

In this lab you will create a concordance. What is a concordance? It is an index to the words of a text or of a body of texts. For example, if you are writing an essay about Shakespeare's view of kingship, you might want to look at the instances in his plays where the word "king" is used. There are a lot of these instances. You can find them all by looking at a concordance to Shakespeare -- look up the word "king" and you will get references by Play, Act, Scene and Line Number, to every use of this word in every one of Shakespeare's plays.. The Oberlin College library has concordances to Shakespeare and Donne and Chaucer and Dante and Vergil and Plato and even to Joyce's Finnegan's Wake. It has several concordances to the Bible, and the Qur'an and the Guanzi. In fact, the library has more than 150 books whose title starts "A concordance to ..."

One of the issues that the creator of a concordance faces is how to refer to a specific use of a word. We are going to take the easy way out and just use line numbers. This is great for making a concordance to a single poem, and less practical for a novel. Here is one small portion of the output of our concordance for The Love Song of J. Alfred Prufrock by T.S. Eliot:

etherized 3
evening 2 17 77
evenings 50
eyes 55 56

So the word "etherized" appears on line 3, "evening" appears 3 times, on lines 2, 17 and 77, and so forth. In this lab you will write a program that asks the user for the name of a text file, and then prints a concordance of the text in that file.

Data Structures: Lists and Dictionaries

The interesting parts of this lab are the structures we use to create the concordance. We need to store line numbers, possibly one and possibly many, for each word in the text. This is a problem of association -- we want to associate line numbers with words. Dictionaries are the structures to use for this. Dictionaries are designed to efficiently associate one datum with another. In dictionary terminology, keys are the things we use to look up values. The keys act like indexes. For our situation the words of the text will be our keys; the line numbers on which a word is found will be the value associated with that word. The line numbers themselves should be sequential -- we want to store them in increasing order. Lists are good for this, and are easy to use. Altogether, our concordance will be stored as a dictionary, where the keys are words (strings) and the values are lists of line numbers.

We have talked about both dictionaries and lists if class. Here are reminders of how these structures work in Python:

Dictionaries

  Ages = {}                        # sets Ages to be an empty dictionary
  Ages["Hermione"]              # returns the value associated with "Hermione", 
                                     # presumbly her age.  Throws an error if 
                                     # "Hermione" is not a key of Ages.
  Ages["Hermione"] = 18          # Makes "Hermione" a key and associates 18 with it.
  del Ages["Hermione"]           # removes key "Hermione" and the value associated with it.
  Ages.keys())                    # returns a "view" of the keys of Ages. You can 
                                     # treat this like a list of the keys.
  len(Ages)                        # returns the number of keys in Ages
  for person in Ages:            # Iterates over the keys in Ages.

 

Lists

  Numbers = []                     # sets Numbers to be an empty list
  Numbers[3]                       # returns the fourth entry of Numbers. 
                                     # Throws an error if Numbers does not have 4 entries.

  Numbers.append(18)              # adds entry 18 onto the end of Numbers

  del Numbers[3]                 # removes the fourth entry of Numbers, shifting
                                     # later entries down.

  for x in Numbers   :            # Iterates over the list Numbers.

 

Finally, in this lab we will make repeated use of several methods of the String class:

String Methods

In these examples assume s is a string.

  s.strip()                     # returns a string like s only with leading and
                                  # trailing white space delelted
  "  bob    ".strip()             # "bob"  
s.strip(p) # Here p is a string of punctuation characters to be # deleted. This returns a string like s, only with # all of the letters of p, in any order, deleted # from the front and back of s. "{bob!*!!".strip( "(!*" ) # "bob" "isn't".strip( "!'?" ) # "isn't" s.split( ) # returns a list of the "words" in s, using # white space as the separator between words "The time is now!".split( ) # ["The", "time", "is", "now!"] s.split( delim ) # returns a list of the "words" in s, using the # string delim as the separator between words. "8/10/2015".split( "/" ) # ["8", "10", "2015"] s.lower() # returns a copy of s with all letters converted to # lower-case s.upper() # returns a copy of s with all letters converted to # upper-case "aBC(De)fG".lower() # "abc(de)fg"

Part 2 - Your Program

Your program should ask the user for the name of one file. That file should be in the same folder as your program. Most text files have names that end in ".txt", so be sure to type this as part of the file name when you are running the program. Your program should open this file and read it one line at a time (a for-loop does this nicely), counting the line numbers (only count the non--blank lines; the first non-blank line should be numbered 1). Each word in the line should be stripped of punctuation marks, converted to lower-case, and added to your concordance with its line number. After the entire file is processed, you should print all of the words that are keys of your concordance, in alphabetical order, along with the list of line numbers for each word. Finally, at the end you should print the number of lines in the file and the number of unique words found.

For example, consider the file Test.txt:

one!!

Two Two
!!!! --
four four four four

five five Five! 'five five

Here is the output we want from this file:

five 5 5 5 5 5
four 4 4 4 4
one 1
two 2 2
I found 5 lines containing 4 unique words.

The word "one" appears once on the first line of the file; "two" appears twice on the line numbered 2 (we ignored the blank line between 1 and 2). There is a line 3, but the "words" on it consist only of punctuation characters so they are never added to the concordance.

There are 3 isses to consider with this program:

  1. How to read the file line-by-line, counting the line numbers
  2. How to get the individual words from a line, strip off their punctuation, and convert them to lower-case
  3. How to add the words and their line numbers to the concordance.

Reading the file, counting the lines

Remember that we have to open the file to connect your program to it:

F = open( <name>, "r")

After that the for loop:

for line in F:
    ....

will process the file one line at a time. The variable line is a string representation of one line of the file, including the "\n" character at the end of the line. Even an otherwise empty line has this "\n" character, so the line variable is never an empty string. We only want to increment our line counter if the line isn't empty. The following code will do that:

lineNumber = 0
for line in F:
    strippedLine = line.strip( )
    if strippedLine != "":
        lineNumber = lineNumber+1

This uses the strip( ) method for strings, with no arguments, to remove all white space, including "\n", from the front and the back of the line.

Handling individual words

Python has a very handy method for dividing a string into words. If s is a string, then s.split( ) returns a list of the "words" of s, using white space the delimiter between words. It is also possible to give split( ) an argument, which it uses as the delimiter between words, but we want white space. This means we can use a loop such as

for word in strippedLine.split( ):

and variable word will iterate through all of the "words" on strippedLine. Now, Python doesn't know English. It has no sense of what are English words; it just groups any sequence of symbols delimited by white space as a word. In text documents we frequently have punctuation attached to actual words with no intervening spaces, as in "Pow!", "where?" and "(parenthetical comment)". We don't want to enter "(parenthetical" as one of the words in our concordance; we want "parenthetical". Fortunately, Python gives an easy solution to this. The String method strip( ), which we are using to eliminate the "\n" character at the end of lines, can also be used to eliminate punctuation. Above we used it with no arguments, in which case it removes white space from the front and back of a string. If we give strip( ) a string argument, it removes all of the characters of this string, in any order, from the front and back of the string to which it is applied. We need to remove punctuation, so we build up a string that has every punctuation character: punc = "()[]{};:',." and so forth. Then

word2 = word.strip( punc )

will remove all of the characters of punc from the front and back of word without reaching inside to delete the apostrophe from "isn't". This is exactly what we want.

Finally, we want to equate upper-case and lower-case versions of the same word: as far as our concordance is concerned "When" and "when" are the same word. The string method s.lower( ) handles this by returning a copy of s with all letters converted to lower-case.

Handling the dictionary

As we have said, the Concordance is a dictinary. At the start of your program you will create an empty dictionary for your concordnace:

Concordance = { }

Each time you come across a word you need to know if it is already a key in your concordance:

if word in Concordance.keys():
    ....

Suppose word is found on line L. If word is already a key, you can add this new occurrance with

Concordance[word].append(L)

On the other hand, if word is not yet a key you can make it one with

Concordance[word]=[L]

You must work with the dictionary in this way. If you try to say

Concordance[word].append(L)

when word is not a key, your program will crash, for there is no Concordance[word] list to append onto.

 

After processing the entire text file you need to print all of the words in alphabetical order, followed by their line numbers. You can't directly sort a key structure; you need to first convert it to a list and then sort it:

words = list( Concordance.keys() )
words.sort( )
for word in words:
    print( word, end=" ")
    for lineNumber in Concordance[word]:
        <print lineNumber in a nice way

You should print the list of line numbers in a nice way. If instead of using a loop you just say

for word in words:
    print(word, Concordance[word])

your line numbers will be printed inside square brackets, which is ugly. You can decide for yourself how to handle very long lists of line numbers. It would be nice to print them with a certain number per line, but that is up to you.

Your program

The design of your program is up to you, but you should certainly divide the work to be done among several functions. One way to do this would be to use the following functions:

RemovePunctuation(s):
This function returns a new string that has the letters of s translated to lower-case, with all of the punctuation removed.
AddWord(word, lineNumber, C):
This handles the work of adding the fact that the given word was found on the given lineNumber to the dictionary C
PrintEntry(word, C):
This handles one word of the output. C is the concordance, so C[word] is the list of line numbers on which word occurs.
main( ):
This gets the file name, opens the file,and has a loop to read the file one line at a time, then splitting the line into words. RemovePunctuation( ) prepares the word for adding to the concordance; AddWord( ) actually does the addition. Finally, a loop over the keys of the concordance calls PrintEntry( ) on each word to handle the output

Testing your work

Here are several files that will help you test out your program:

Test.txt                 ( a file for testing your line numbering)
Prufrock.txt          (T.S. Eliot's The Love Song of J. Alfred Prufrock)
Jabberwocky.txt   (The Lewis Carrol poem)
FoxInSocks.txt     (by Dr. Seuss)

File Test.txt should give you the following output:

five 5 5 5 5 5
four 4 4 4 4
one 1
two 2 2
I found 5 lines containing 4 unique words.

If you get something else there is either a problem with your line numbering or the way you are stripping punctuation. The other files are mainly useful for checking punctuation; there are many different punctuation characters used in these files and you should remove all of them. Look carefully at your output. If you see what appears to be a blank word followed by line numbers, it probably comes in the following way. The split( ) method separates a string into words by using white space as as delimiter, so some "words" might be just sequences of punctuation characters, such as "!!!". When you strip off the punctuation you are left with the empty string. Before you add a word and its line numbere to the concordance you should check if the word is the empty string; if it is, just don't add it.

If you want to play with your concordance, here are a few additional files you might work with:

Beowulf.txt (translated to modern English by Hall)
DavidCopperfield.txt (all 626 pages of the Dickens novel)
Inferno.txt (the first third of Dante's Divine Comedy, translated by Norton)
KingLear.txt (the Shakespear play)
Repulic.txt (Plato's "Republic")

Thanks to Project Gutenberg, there are many, many more text files available on the Web.


Handing in your work

You probably have only one file of code. Make sure you have your name at the top in a comment. Also, if you followed the Honor Code in this assignment, make an HonorCode file with the text.

I affirm that I have adhered to the Honor Code in this assignment.

You now just need to electronically handin your worK.

    % cd                # changes to your home directory
    % cd cs150          # goes to your cs150 folder
    % handin            # starts the handin program
                        # class is 150
						  # assignment is 7
			              # file/directory is lab07

    % lshand            # should show that you've handed in something

You can also specify the options to handin from the command line

    % cd ~/cs150                 # goes to your cs150 folder
    % handin -c 150 -a 6 lab076

File Checklist


You should have submitted the following files:
  concordance.py
  HonorCode (with the Honor Pledge)